Missing data and multiple imputation

MACS 30200 University of Chicago

Causes of missingness

  • Surveys
  • Errors in data collection
  • Intentional
  • Censored values

Patterns of missingness

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)
  • Why do we care?
    • Mechanism
    • Ignorable vs. non-ignorable

Things to consider

  1. Does the method provide consistent estimates of the population parameters?
  2. Does the method provide valid statistical inferences?
  3. Does the method use the observed data efficiently or does it recklessly discard information?

Complete-case analysis

  • Listwise deletion
  • Advantages
  • Disadvantages

Available-case analysis

  • Pairwise deletion
  • Advantages
  • Disadvantages

Imputation

  • Imputation
  • Unconditional mean imputation
  • Conditional mean imputation

Maximum-likelihood estimation

\[p(\mathbf{X}, \theta) = p(\mathbf{X}_{\text{obs}}, \mathbf{X}_{\text{mis}}; \theta)\]

  • Data MAR

    \[p(\mathbf{X}_\text{obs}; \theta) = \int{p(\mathbf{X}_{\text{obs}}, \mathbf{X}_{\text{mis}}; \theta)} d\mathbf{X}_{\text{mis}}\]

  • EM algorithm
  • Single model for all variables

Predictive mean matching

  1. For cases with no missing data, estimate a linear regression of \(x\) on \(z\), producing a set of coefficients \(b\)
  2. Make a random draw from the posterior predictive distribution of \(b\), producing a new set of coefficients \(b*\)
  3. Using \(b*\), generate predicted values for \(x\) for all cases
  4. For each case with missing \(x\), identify a set of cases with observed \(x\) whose predicted values are close to the predicted value for the case with missing data
  5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value

Non-parametric models

  • Random forests
  • Deep learning

Multiple imputation

  • Generate multiple imputed datasets
  • Bayesian multiple imputation
  • Account for uncertainty within and across the datasets

Conducting inference

\[\tilde{\beta}_j \equiv \frac{\sum_{l=1}^g B_j^{(l)}}{g}\]

Conducting inference

\[\tilde{\text{SE}}(\tilde{\beta}_j) = \sqrt{V_j^{(W)} + \frac{g + 1}{g} V_j^{(B)}}\]

\[V_j^{(W)} = \frac{\sum_{l=1}^g \text{SE}^2(B_j^{(l)})}{g}\]

\[V_j^{(B)} = \frac{\sum_{l=1}^g (B_j^{(l)} - \tilde{B}_j)^2}{g-1}\]

\[\text{SE}^2(B_j^{(l)})\]

Practical considerations for multiple imputation

  • Which variables to include
  • Transform variables to approximately normal
  • Adjust the imputed data to resemble the original data
  • Make sure the imputation model captures relevant features of the data
  • \(g\) doesn’t need to be large

Infant mortality

Regression model

##                term estimate std.error statistic  p.value
## 1       (Intercept)   6.8840   0.29039     23.71 1.58e-31
## 2 log(GDPperCapita)  -0.2943   0.05765     -5.10 3.85e-06
## 3     contraception  -0.0113   0.00424     -2.66 1.01e-02
## 4   educationFemale  -0.0770   0.03378     -2.28 2.63e-02

Missingness

infantMortality GDPperCapita contraception educationFemale
6 10 63 131

Amelia

library(Amelia)
un.out <- amelia(as.data.frame(un), m = 5,
                 idvars = c("country", "region"))
## Warning: There are observations in the data that are completely missing. 
##          These observations will remain unimputed in the final datasets. 
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70
## List of 5
##  $ imp1:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 1.86 NA ...
##   ..$ contraception         : num [1:207] -12 71.9 52 43.3 NA ...
##   ..$ educationMale         : num [1:207] 4.05 11.02 11.1 13 NA ...
##   ..$ educationFemale       : num [1:207] 0.319 10.646 9.9 12.313 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 3207 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 78.3 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 68.9 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 8.941 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 16.02 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp2:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 3.39 NA ...
##   ..$ contraception         : num [1:207] 17.3 50.8 52 19.8 NA ...
##   ..$ educationMale         : num [1:207] 7.2 7.72 11.1 12.77 NA ...
##   ..$ educationFemale       : num [1:207] 2.11 9.37 9.9 12.76 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 8316 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 90.1 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 60.4 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 2.955 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 3.52 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp3:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 3.35 NA ...
##   ..$ contraception         : num [1:207] -7.24 56.17 52 67.65 NA ...
##   ..$ educationMale         : num [1:207] 6.16 9.37 11.1 14 NA ...
##   ..$ educationFemale       : num [1:207] 3.58 10.19 9.9 13.66 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 3568 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 78.9 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 63 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 1.728 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 14.92 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp4:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 1.66 NA ...
##   ..$ contraception         : num [1:207] 19.9 35.3 52 66.9 NA ...
##   ..$ educationMale         : num [1:207] 6.52 5.18 11.1 13.19 NA ...
##   ..$ educationFemale       : num [1:207] 2.28 5.72 9.9 13.49 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 4049 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 89.5 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 34 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 16.589 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 19.43 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp5:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 2.18 NA ...
##   ..$ contraception         : num [1:207] 20.6 32.5 52 71.5 NA ...
##   ..$ educationMale         : num [1:207] 7.37 10.27 11.1 15.1 NA ...
##   ..$ educationFemale       : num [1:207] 4.22 10.08 9.9 16.11 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 15980 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 73.7 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 33.9 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 15.825 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 24.52 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  - attr(*, "class")= chr [1:2] "mi" "list"

MI scatterplot

Conducting inference

## # A tibble: 20 × 6
##       id              term estimate std.error statistic   p.value
##    <chr>             <chr>    <dbl>     <dbl>     <dbl>     <dbl>
##  1  imp1       (Intercept)  6.47310   0.16284     39.75  1.30e-96
##  2  imp1 log(GDPperCapita) -0.20206   0.02988     -6.76  1.47e-10
##  3  imp1     contraception -0.00480   0.00241     -2.00  4.72e-02
##  4  imp1   educationFemale -0.14254   0.01799     -7.92  1.60e-13
##  5  imp2       (Intercept)  6.44744   0.14350     44.93 8.28e-106
##  6  imp2 log(GDPperCapita) -0.20265   0.02722     -7.45  2.91e-12
##  7  imp2     contraception -0.00596   0.00206     -2.90  4.18e-03
##  8  imp2   educationFemale -0.13358   0.01461     -9.14  7.46e-17
##  9  imp3       (Intercept)  6.57374   0.15260     43.08 3.79e-103
## 10  imp3 log(GDPperCapita) -0.20811   0.02774     -7.50  2.00e-12
## 11  imp3     contraception -0.00507   0.00224     -2.26  2.48e-02
## 12  imp3   educationFemale -0.14579   0.01707     -8.54  3.42e-15
## 13  imp4       (Intercept)  6.49875   0.17864     36.38  1.29e-89
## 14  imp4 log(GDPperCapita) -0.21912   0.03250     -6.74  1.68e-10
## 15  imp4     contraception -0.00710   0.00228     -3.11  2.14e-03
## 16  imp4   educationFemale -0.11895   0.01596     -7.45  2.75e-12
## 17  imp5       (Intercept)  6.52708   0.16142     40.44  1.24e-97
## 18  imp5 log(GDPperCapita) -0.21800   0.03058     -7.13  1.84e-11
## 19  imp5     contraception -0.00650   0.00218     -2.98  3.27e-03
## 20  imp5   educationFemale -0.12604   0.01722     -7.32  6.09e-12

Conducting inference

##                term estimate std.error estimate.mi std.error.mi
## 1       (Intercept)   6.8840   0.29039     6.50402      0.16896
## 2 log(GDPperCapita)  -0.2943   0.05765    -0.20999      0.03097
## 3     contraception  -0.0113   0.00424    -0.00588      0.00247
## 4   educationFemale  -0.0770   0.03378    -0.13338      0.02064

Missingness map

Transforming variables

Transforming variables

Transforming variables

un_lite.out <- amelia(un_lite, m = 5,
                      logs = c("infantMortality", "GDPperCapita"),
                      sqrt = c("tfr"))
## Warning: There are observations in the data that are completely missing. 
##          These observations will remain unimputed in the final datasets. 
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54

New model results

## # A tibble: 20 × 6
##       id              term estimate std.error statistic   p.value
##    <chr>             <chr>    <dbl>     <dbl>     <dbl>     <dbl>
##  1  imp1       (Intercept)   6.4507   0.16693     38.64  5.43e-95
##  2  imp1 log(GDPperCapita)  -0.2271   0.03471     -6.54  4.88e-10
##  3  imp1     contraception  -0.0101   0.00239     -4.23  3.56e-05
##  4  imp1   educationFemale  -0.0955   0.02010     -4.75  3.82e-06
##  5  imp2       (Intercept)   6.5306   0.17201     37.97  1.24e-93
##  6  imp2 log(GDPperCapita)  -0.3042   0.03822     -7.96  1.26e-13
##  7  imp2     contraception  -0.0146   0.00248     -5.89  1.56e-08
##  8  imp2   educationFemale  -0.0285   0.02346     -1.21  2.26e-01
##  9  imp3       (Intercept)   6.4457   0.17031     37.85  2.15e-93
## 10  imp3 log(GDPperCapita)  -0.2437   0.03737     -6.52  5.59e-10
## 11  imp3     contraception  -0.0153   0.00225     -6.83  9.83e-11
## 12  imp3   educationFemale  -0.0606   0.02074     -2.92  3.89e-03
## 13  imp4       (Intercept)   6.1839   0.16821     36.76  3.57e-91
## 14  imp4 log(GDPperCapita)  -0.1512   0.03652     -4.14  5.11e-05
## 15  imp4     contraception  -0.0109   0.00221     -4.91  1.90e-06
## 16  imp4   educationFemale  -0.1260   0.01892     -6.66  2.56e-10
## 17  imp5       (Intercept)   6.4678   0.15379     42.06 1.44e-101
## 18  imp5 log(GDPperCapita)  -0.2257   0.02989     -7.55  1.48e-12
## 19  imp5     contraception  -0.0137   0.00206     -6.65  2.73e-10
## 20  imp5   educationFemale  -0.0804   0.01562     -5.15  6.23e-07
##                term estimate std.error estimate.mi std.error.mi
## 1       (Intercept)   6.8840   0.29039      6.4158      0.22183
## 2 log(GDPperCapita)  -0.2943   0.05765     -0.2304      0.06954
## 3     contraception  -0.0113   0.00424     -0.0129      0.00341
## 4   educationFemale  -0.0770   0.03378     -0.0782      0.04483